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Abstract 

Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in 
data for regression, classification and clustering. For continuous data which we con¬ 
sider here in the context of regression and cluster analysis, MoE usually use normal 
experts, that is, expert components following the Gaussian distribution. However, 
for a set of data containing a group or groups of observations with asymmetric 
behavior, heavy tails or atypical observations, the use of normal experts may be un¬ 
suitable and can unduly affect the fit of the MoE model. In this paper, we introduce 
new non-normal mixture of experts (NNMoE) which can deal with these issues re¬ 
garding possibly skewed, heavy-tailed data and with outliers. The proposed models 
are the skew-normal MoE and the robust t MoE and skew t MoE, respectively named 
SNMoE, TMoE and STMoE. We develop dedicated expectation-maximization (EM) 
and expectation conditional maximization (ECM) algorithms to estimate the pa¬ 
rameters of the proposed models by monotonically maximizing the observed data 
log-likelihood. We describe how the presented models can be used in prediction and 
in model-based clustering of regression data. Numerical experiments carried out on 
simulated data show the effectiveness and the robustness of the proposed models in 
terms modeling non-linear regression functions as well as in model-based clustering. 

Then, to show their usefulness for practical applications, the proposed models are 
applied to the real-world data of tone perception for musical data analysis, and the 
one of temperature anomalies for the analysis of climate change data. 

keywords: mixture of experts, skew normal distribution, t distribution, skew t distribu¬ 
tion, EM algorithm, ECM algorithm, non-linear regression, model-based clustering 
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1 Introduction 


Mixture of experts (MoE) introduced by (Jacobs et al., 1991) are widely studied in statis¬ 
tics and machine learning. They consist in a fully conditional mixture model where both 
the mixing proportions, known as the gating functions, and the component densities, 
known as the experts, are conditional on some input covariates. MoE have been investi¬ 
gated, in their simple form, as well as in their hierarchical form Jordan and Jacobs (1994) 
(e.g Section 5.12 of McLachlan and Peel. (2000)) for regression and model-based cluster 
and discriminant analyses and in different application domains. A complete review of 
the MoE models can be found in Yuksel et al. (2012). For continuous data, which we 
consider here in the context of non-linear regression and model-based cluster analysis, 
MoE usually use normal experts, that is, expert components following the Gaussian dis¬ 
tribution. Along this paper. We will call it the normal mixture of experts, abbreviated 
as NMoE. However, it is well-known that the normal distribution is sensitive to outliers. 
Moreover, for a set of data containing a group or groups of observations with heavy tails 
or asymmetric behavior, the use of normal experts may be unsuitable and can unduly 
affect the £t of the MoE model. In this paper, we attempt to overcome these limitations 
in MoE by proposing more adapted and robust mixture of experts models which can deal 
with possibly skewed, heavy-tailed and atypical data. 

Recently, the problem of sensitivity of NMoE to outliers have been considered by 
Nguyen and McLachlan (2014) where the authors proposed a Laplace mixture of lin¬ 
ear experts (LMoLE) for a robust modeling of non-linear regression data. The model 
parameters are estimated by maximizing the observed-data likelihood via a minorization- 
maximization (MM) algorithm. Here, we propose alternative MoE models, by relaying 
on other non-normal distributions that generalize the normal distribution, that is, the 
skew-normal, t, and the skew-t distributions. We call these proposed NNMoE models, 
respectively, the skew-normal MoE (SNMoE), the t MoE (TMoE), and the skew-f MoE 
(STMoE). Indeed, in these last years, the use of the skew normal distribution, hrstly 
proposed by Azzalini (1985, 1986), has been shown beneficial in dealing with asymmetric 
data in various theoretic and applied problems. This has been studied in the hnite mixture 
literature by namely Lin et al. (2007b) for modeling asymmetric univariate data with the 
univariate skew-normal mixture. On the other hand, the t distribution provides a natural 
robust extension of the normal distribution to model data with possible outliers. This has 
been integrated to develop the t mixture model proposed by Mclachlan and Peel (1998) 
for robust cluster analysis of multivariate data. Recently, Bai et al. (2012) proposed a 
robust mixture modeling in the regression context on univariate data, by using a univari¬ 
ate f-mixture model. Moreover, in many practical problems, the robustness of t mixtures 
may however be not sufficient in the presence of asymmetric observations. To deal with 
this issue, Lin et al. (2007a) proposed the univariate skew-t mixture model which allows 
for accommodation of both skewness and thick tails in the data, by relying on the skew-f 
distribution, introduced by Azzalini and Capitanio (2003). For the general multivari¬ 
ate case using t, skew-normal and skew-f mixtures, one can refer to Mclachlan and Peel 
(1998); Peel and Mclachlan (2000), Pyne et al. (2009), (Lin, 2010), Lee and McLachlan 
(2013b), Lee and McLachlan (2013a), Lee and McLachlan (2014), and recently, the unify- 
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ing framework for previous restricted and unrestricted skew-t mixtures, using the CFUST 
distribution Lee and McLachlan (2015). The inference in the previously described ap¬ 
proaches is performed by maximum likelihood estimation via expectation-maximization 
(EM) or extensions (Dempster et ah, 1977; McLachlan and Krishnan, 2008), in particu¬ 
lar the expectation conditional maximization (ECM) algorithm (Meng and Rubin, 1993). 
For the Bayesian framework, Fruhwirth-Schnatter and Pyne (2010) have considered the 
Bayesian inference for both the univariate and the multivariate skew-normal and skew-t 
mixtures. For the regression context, the robust modeling of regression data has been 
studied namely by Wei (2012) who considered a t-mixture model for regression analy¬ 
sis of univariate data, as well as by Bai et al. (2012) who relied on the M-estimate in 
mixture of linear regressions. In the same context of regression. Song et al. (2014) pro¬ 
posed the mixture of Laplace regressions, which has been then extended by Nguyen and 
McLachlan (2014) to the case of mixture of experts, by introducing the Laplace mixture 
of linear experts (LMoLE). Recently, Zeller et al. (2015) introduced the scale mixtures of 
skew-normal distributions for robust mixture regressions. However, unlike our proposed 
NNMoE models, the regression mixture models of Wei (2012), Bai et al. (2012), Song 
et al. (2014), Zeller et al. (2015) do not consider conditional mixing proportions, that 
is, mixing proportions depending on some input variables, as in the case of mixture of 
experts, which we investigate here. In addition, the approaches of Wei (2012), Bai et al. 
(2012) and Song et al. (2014) do not consider both the problem of robustness to outliers 
and the one to deal with possibly asymmetric data. Indeed, here we consider the mixture 
of experts framework for non-linear regression problems and model-based clustering of 
regression data, and we attempt to overcome the limitations of the NMoE model for deal¬ 
ing with asymmetric, heavy-tailed data and which may contain outliers. We investigate 
the use of the skew-normal, t and skew t distributions for the experts, rather than the 
commonly used normal distribution. First, the skew-normal mixture of experts (SNMoE) 
is proposed to accommodate data with possible asymmetric behavior. For heavy tailed 
or possibly noisy data, that is, data with atypical observations, we hrst propose the t- 
mixture of experts model (TMoE) to handle the issues regarding namely the sensitivity of 
the NMoE to outliers. Finally, we propose the skew-t mixture of experts model (STMoE) 
which allows for accommodation of both skewness and heavy tails in the data and which 
is also robust to outliers. These models correspond to extensions of the unconditional 
mixture of skew-normal (Lin et ah, 2007b), t (Mclachlan and Peel, 1998; Wei, 2012), and 
skew t (Lin et ah, 2007a) models, to the mixture of experts (MoE) framework, where the 
mixture means are regression functions and the mixing proportions are covariate-varying. 
For the models inference, we develop dedicated expectation-maximization (EM) and ex¬ 
pectation conditional maximization (ECM) algorithms to estimate the parameters of the 
proposed models by monotonically maximizing the observed data log-likelihood. The EM 
algorithms are indeed very popular and successful estimation algorithms for mixture mod¬ 
els in general and for mixture of experts in particular. Moreover, the EM algorithm for 
MoE has been shown by Ng and McLachlan (2004) to be monotonically maximizing the 
MoE likelihood. The authors have showed that the EM (with IRLS in this case) algorithm 
has stable convergence and the log-likelihood is monotonically increasing when a learning 
rate smaller than one is adopted for the IRLS procedure within the M-step of the EM 
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algorithm. They have further proposed an expectation conditional maximization (ECM) 
algorithm to train MoE, which also has desirable numerical properties. The MoE has also 
been considered in the Bayesian framework, for example one can cite the Bayesian MoE 
Waterhouse et ah (1996); Waterhouse (1997) and the Bayesian hierarchical MoE Bishop 
and Svensen (2003). Beyond the Bayesian parametric framework, the MoE models have 
also been investigated within the Bayesian non-parametric framework. We cite for ex¬ 
ample the Bayesian non-parametric MoE model (Rasmussen and Ghahramani, 2001) and 
the Bayesian non-parametric hierarchical MoE approach of J. Q. Shi and Titterington 
(2005) using Gaussian Processes experts for regression. For further models on mixture 
of experts for regression, the reader can be referred to for example the book of Shi and 
Ghoi (2011). In this paper, we investigate semi-parametric models under the maximum 
likelihood estimation framework. 

The remainder of this paper is organized as follows. In Section 2 we briefly recall 
the MoE framework, the NMoE model and its maximum-likelihood estimation via EM. 
In Section 3, we present the SNMoE model and in Section 4 we present its inference 
technique using the EGM algorithm. Then, in Section 5 we present the TMoE model and 
derive its parameter estimation technique using the EM algorithm in Section 6. Then, 
in Section 7, we present the STMoE model and in Section 8 the parameter estimation 
technique using the EGM algorithm. In Section 11, we also show how the model selection 
can be performed for these NNMoE models. We then investigate in Section 9 the use of 
the proposed models for fitting non-linear regression functions as well for prediction on 
future data. We also show in Section 10 how the models can be used in a model-based 
clustering prospective. In Section 12, we perform experiments to assess the proposed 
models. Finally, in Section 13, conclusions are drawn and a future work 


2 Mixture of experts for continuous data 

Mixture of experts (Jacobs et ah, 1991; Jordan and Jacobs, 1994) are used in a variety 
of contexts including regression, classification and clustering. Here we consider the MoE 
framework for fitting (non-linear) regression functions and clustering of univariate con¬ 
tinuous data . The aim of regression is to explore the relationship of an observed random 
variable V given a covariate vector X G via conditional density functions for Y\X = x 
of the form f[y\x), rather than only exploring the unconditional distribution of Y. For 
their reach modeling flexibility, mixture models (McLachlan and Peel., 2000) has took 
much attention for non-linear regression problems and we distinguish in particular mix¬ 
ture of regressions and mixture of experts for regression analysis. The univariate mixture 
of regressions model assumes that the observed pairs of data {x, y) where y G M is the 
response for some covariate x ^ W, are generated from K regression functions and are 
governed by a hidden categorical random variable Z indicating from which component 
each observation is generated. Thus, the mixture of regressions model decomposes the 
nonlinear regression model density f{y\x) into a convex weighted sum of K regression 
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component models fk{y\x) and can be defined as follows: 

K 

f{y\x-,^) = '^7rkfk{y\x-,^k) (1) 

k=l 

where the tt^’s are defined by = ¥(Z = k) and represent the non-negative mixing pro¬ 
portions that sum to 1. The model parameter vector is given by iP' = (tti, ..., vr^-i, • • •, 
^k being the parameter vector of the kth component density. 

Although similar, the mixture of experts (Jacobs et ah, 1991) differ from regression 
mixture models in many aspects. One of the main differences is that the MoE model 
consists in a fully conditional mixture while in the regression mixture, only the component 
densities are conditional. Indeed, the mixing proportions are constant for the regression 
mixture, while in the MoE, they are modeled as a function of the inputs, generally modeled 
by logistic or a softmax function. 


2.1 The mixture of experts (MoE) model 

Mixture of experts (MoE) for regression analysis (Jacobs et ah, 1991; Jordan and Jacobs, 
1994) extend the model (1) by modeling the mixing proportions as function of some 
covariates r G M'^. The mixing proportions, known as the gating functions in the context 
of MoE, are modeled by the multinomial logistic model and are defined by: 


P(Z = fc|r; a) 


exp jcxlr) 
E£iexp(Q:fr) 
7ik{r; (x) 


( 2 ) 


where r G is a covariate vector, ctk is the g-dimensional coefficients vector associated 
with r and o: = (ctf ,..., cx^_iY' is the parameter vector of the logistic model, with cx.k 
being the null vector. Thus, the MoE model consists in a fully conditional mixture model 
where both the mixing proportions (the gating functions) and the component densities 
(the experts) are conditional on some covariate variables (respectively r and x). The use 
of mixtures with mixing proportions defined through a logistic regression model has also 
been studied by Huang et ah (2015) for penalized model-based clustering of spatial data 
by using a mixture of offset-normal shape factor analyzers (MOSEA). 


2.2 The normal mixture of experts (NMoE) model and its max¬ 
imum likelihood estimation 

In the case of mixture of experts for regression, it is usually assumed that the experts are 
normal, that is, follow a normal distribution. A iP-component normal mixture of experts 
(NMoE) [K > 1) has the following formulation: 

K 

f{y\r,x-^) = '^TTk{r-OL)¥i{y-y{x-(3i,),(Tl) (3) 

k=l 
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which involves, in the semi-parametric case, component means defined as parametric 
(non-)hnear regression functions 

The NMoE model parameters are estimated by maximizing the observed data log- 
likelihood by using the EM algorithm (Dempster et ah, 1977; Jacobs et ah, 1991; Jordan 
and Jacobs, 1994; Jordan and Xu, 1995; Ng and McLachlan, 2004; McLachlan and Kr- 
ishnan, 2008). Suppose we observe an i.i.d sample of n observations {yi,... ,yn) with 
their respective associated covariates (a;i,..., Xn) and (ri,..., Xr). Then under the MoE 
model, the observed data log-likelihood for the parameter vector ^ is given by: 

n K 

\ogL{^) = ^log^7rfc(ri;Q:)N {yi; y{x; (3,,), al) . (4) 

i=l k=l 


The E-Step at the mth iteration of the EM algorithm for the NMoE model requires the 
calculation of the following posterior probability that the observation {yi,Xi,ri) belongs 
to expert k, given a parameter estimation 


ik 




Trk{r;OL'^ 




f{yu 'P'"'*: 


(5) 


Then, the M-step calculates the parameter update by maximizing the well-known 

Q-function, that is the expected complete-data log-likelihood: 




( 6 ) 


where 17 is the parameter space. For example, in the case of normal mixture of linear 
experts (NMoLE) where each expert’s mean has the flowing linear form: 

f^k) = (7) 


where G MP is the vector of regression coefficients of component k, the updates for each 
of the expert component parameters consist in analytically solving a weighted Gaussian 
linear regression problem and are given by: 




2im+l) 

^k 


X 

2=1 


(m) T 
‘ik 


1 -1 




i=l 






X 


V" r 


(m) 


ik 


( 8 ) 

( 9 ) 


For the mixing proportions, the parameter update cannot however be obtained in 

a closed form. It is calculated by Iteratively Reweighted Least Squares (IRLS) (Jacobs 
et ah, 1991; Jordan and Jacobs, 1994; Chen et ah, 1999; Green, 1984; Chamroukhi et ah, 
2009a; Chamroukhi, 2010). 
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However, the normal distribution is not adapted to deal with asymmetric and heavy 
tailed data. It is also known that the normal distribution is sensitive to outliers. In the 
proposal, we hrst propose to address the issue regarding the skewness, by proposing the 
skew-normal mixture of experts (SNMoE). Then, we propose a robust htting of the MoE, 
which is adapted to heavy-tailed data, by using the t distribution, that is, the t mixture 
of experts (TMoE). Finally, the proposed skew-f mixture of experts (STMoE) allows for 
simultaneously accommodating asymmetry and heavy tails in the data and is also robust 
to outliers. 


3 The skew-normal mixture of experts (SNMoE) model 

The skew-normal mixture of experts (SNMoE) model uses the skew-normal distribution 
as density for the expert components. We hrst recal the skew-normal distribution and 
describe its stochastic and hierarchical presentation, to then integrate them into the pro¬ 
posed SNMoE model. 

3.1 The skew-normal distribution 

As introduced by (Azzalini, 1985, 1986), a random variable Y follows a univariate skew- 
normal distribution with location parameter p G M, scale parameter G (0, cxd) and 
skewness parameter A G M if it has the density 

( 10 ) 

a a \ ^ J 

where 0(.) and $(.) denote, respectively, the probability density function (pdf) and the 
cumulative distribution function (cdf) of the standard normal distribution. It can be seen 
from (10) that when A = 0, the skew-normal reduces to the normal distribution. As 
presented by Azzalini (1986); Henze (1986), if 

Y = ^ + 5\U\ + ^l-5'^E (11) 

where 5 = U and E are independent random variables following the normal distri¬ 

bution N(0, cr^), then Y follows the skew-normal distribution with pdf SN(/i, A) given 
by (10). In the above, \U\ denotes the magnitude of U. This stochastic representation 
of the skew-normal distribution leads to the following hierarchical representation in an 
incomplete data framework, as presented in Lin et ah (2007b): 

Y\u ~ N(/i + 5|M|,(l-(52)a2), 

U ~ N(0,a2). ^ 

This hierarchical representation greatly facilitates the inference for the model, namely 
in the skew-normal mixture model. Introduced by Lin et al. (2007b), a iL-component 
skew-normal mixture model is given by: 

K 

SN(|/;/ifc,afc, Afc) 

k=l 
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where the mixture components have a skew-normal density SN(.;.,.,.) given by (10). 
For the skew-normal mixture, the mixing proportions and the means of the mixture 
components are assumed to be constant. 

In the following section, we present the skew-normal mixture of experts (SNMoE) 
which extends the skew-normal mixture model to the case of mixture of experts framework, 
by considering conditional distributions for both the mixing proportions and the means 
of the mixture components. 

3.2 The skew-normal mixture of experts (SNMoE) 

The proposed skew-normal MoE (SNMoE) is a iF-component MoE model with skew- 
normal experts. It is dehned as follows. Let SN(/i,cr^,A) denotes a skew-normal dis¬ 
tribution with location parameter p, scale parameter a and skewness parameter A. A 
A-component SNMoE is then dehned by: 

K 

f{y\r,X]^) = ^7rfc(r;Q:)SN(|/;p(a;;/3fc),afc, Afc) . (14) 

k=l 

In the SNMoE model, each expert component k has indeed a skew-normal distribu¬ 
tion, whose density is dehned by (10). The parameter vector of the model is ^ = 
(af ,..., ..., with iPk = {/3k ^ ^ki parameter vector for the fcth 

skewed-normal expert component. It is obvious to see that if the skewness parameter 
Afc = 0 for each k, the SNMoE model (14) reduces to the NMoE model (3). Before 
going on the model inference, we hrst present its stochastic and hierarchical representa¬ 
tions, which will serve to derive the ECM algorithm for maximum likelihood parameter 
estimation. The SNMoE model is characterized as follows. 


3.2.1 Stochastic representation of the SNMoE 

By using the stochastic representation (11) of the skew-normal distribution, the stochastic 
representation for the skew-normal mixture of experts (SNMoE) is as follows. Let U and 
E be independent univariate random variables following the standard normal distribution 
N(0,1) with pdf 0(.). Given some covariates Xi and r*, a random variable Yi is said to 
follow the SNMoE model (14) if it has the following representation: 


Yi — ^{xi, f3^J SziCzilUil + '^i'^ziEi. 


(15) 


In (15), we have 5z^ = 


A... 


where Zi G {1,..., A} is a realization of the categorical 


\A-^ 

variable which follows the multinomial distribution, that is: 


Zi\ri ~ Mult(l; 7ri(ri; a),..., TiK{ru a)) (16) 

where each of the probabilities 7r^.(rj; a) = P(Zj = Zi\ri) is given by the logistic function 
(2). In this incomplete data framework, Zi represents the hidden label of the component 
generating the Ah observation. 
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The stochastic representation (15) of the SNMoE leads to the following hierarchical 
representation, which, as it will be presented in Section 4, greatly facilitates the model 
inference. 

3.2.2 Hierarchical representation of the SNMoE 

By introdncing the binary latent component-indicators Zi^ snch that Zi^ = 1 iff Zj = /c, 

Zi being the hidden class label of the zth observation, a hierarchical model for the SNMoE 
model can be derived from its stochastic representation (15) and is as follows 

T)litj, Zijf 1, Xi ~ N ^^(iCj, /3^) -|- 5k\ui\i (1 ^ 

Ui\Zik = l ~ N(0,(Tfc), (17) 

Zi\ri ~ Mnlt(l;7ri(ri;Q:),...,7r^(ri;a)) 

where Zj = (Z^, • • •, Zix) and 4 = 

V 

4 Maximum likelihood estimation of the SNMoE model 

The nnknown parameter vector ^ of the SNMoE model can be estimated by maximiz¬ 
ing the observed-data log-likelihood. Given an observed i.i.d sample of n observations 
(l/i,..., Hn) with their respective associated covariates (a^i,..., x^) and (ri,..., 03^), nn- 
der the SNMoE model (14), the observed data log-likelihood for the parameter vector ^ 
is given by: 

n K 

log L{^) = ^ log ^ -Kkivi- a)SN {y-y{x- /3fc), A^) . (18) 

i=l k=l 

The maximization of this log-likelihood can not be performed in a closed form. How¬ 
ever, in this latent data framework, the maximization can be performed via expectation- 
maximization (EM)-type algorithms (McLachlan and Krishnan, 2008). More specihcally, 
we propose a dedicated Expectation Conditional Maximization (ECM) algorithm to mono- 
tonically maximize (18). The ECM algorithm (Meng and Rnbin, 1993) is an EM variant 
that mainly aims at addressing the optimization problem in the M-step of the EM algo¬ 
rithm. In ECM, the M-step is performed by several conditional maximization (CM) steps 
by dividing the parameter space into snb-spaces. The parameter vector npdates are then 
performed seqnentially, one coordinate block after another in each snb-space. 

4.1 ECM-algorithm for the SNMoE model 

Deriving the ECM algorithm reqnires the dehnition of the complete-data log-likelihood. 
From the hierarchical representation (17) of the SNMoE, the complete-data log-likelihood 
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»P', where the complete-data are {yi, Zi,Ui,Xi,ri}'!^^i, is given by: 


n K 


logL.i^) = EE ^ifc[log (P(^i = k\ri)) + log {f {ui\Zik = 1)) + log(/ {yi\ui,Zik = 1,®*))] 

i=l k=l 

K 

= log Lc{a)+ '^logLci^k), (19) 


k=l 


with 
log Lc{a) 

logLci^k) 


n K 

'^'^Zik log 7rk{ri;a), 

i=\ k=l 


i=l 


log(27r) - log{al) 


\^og{l-5l) 


dj^k dk dik Ui 

2(1 - dl) (1 - 5l)ak 


u~ 


2(1 - dl) 




where dik = denotes the Mahalanobis distance between yi and the fcth expert’s 

mean (with ak as the standard deviation). Then, the proposed ECM algorithm for the 
SNMoE model performs as follows. It starts with an initial parameter vector and 
alternates between the E- and CM- steps until a convergence criterion is satished. 


4.1.1 E-Step 

The E-Step of the ECM algorithm for the SNMoE calculates the Q-function, that is the 
conditional expectation of the complete-data log-likelihood (19), given the observed data 
{(l/j, ajj, and a current parameter estimation m being the current iteration; 


^(a/ia/W) = E[logL,(a')|{9..a:..r.}“.,;a/'™)]. 

From (19), it follows that the Q-function is given by: 

K 

Q{^- = Qi(a; Q,{^kl 

k=l 


with 


Qi(a;a^(-)) 


n K 


log7rfc(r,;a). 


i=l k=l 


E 

2 = 1 


r. 


(m) 

ik 


log(27r) - log(a^) - - log(l - 6l) 


dk dik 

' {1 - 6l)ak ~ 2{l - dlWk ‘2{1-Sl) 


^i'm) 

^2,ik 


rP 


( 20 ) 


( 21 ) 


( 22 ) 


(23) 
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for /c = 1,..., where the required conditional expectations are given by: 


'^ik — lEr^j>.{m) ll/j, iCj, T’j] , 

[Ui\Zij^ 1, l/j, iCj, Fj] , 

^2,ik ~ lE^(m) \Uj^ \Zik ~ 1) 2/i) • 


The ’s represent the posterior distribution of the hidden class labels Zj and correspond 

to the posterior memberships of the observed data. They are given by: 


T. 


(m) 

ik 


TTfcfr; OL^ 






(24) 


The conditional expectations and correspond to the posterior distribution of the 

hidden variables Ui and Uf, respectively. From the hierarchical representation (17), as 
shown by Lin et al. (2007b) in the case of the skew-normal mixture model, by Bayes’ 
theorem, the posterior distribution of Ui is the following half normal: 


Ui\Zii^ 1, t/i, Xi, Vi ~ TTA^jOjCxd) (hitifci ^uk) 


where the posterior mean and variance in this case of SNMoE are respectively given by: 

= SkiVi - fi{xi] (3k)) and = (1 - 5l)al. 

Then the two conditional expectations of Ui and Uf are respectively given by: 



From (21), (22), and (23), it can be seen that the Q-function is calculated by analytically 
calculating the conditional expectations (24), (25) and (26). 


4.1.2 M-Step 

Then, the M-step calculates the parameter vector as in (6), that is by maximizing 

the Q-function (21) with respect to This can be performed by separately maximizing 
with respect to ol and, for each component A: (A: = 1,..., iF), the function 
Q{^k', with respect to ^k where = {(3^, A^)^. We adopt the ECM extension 

of the EM algorithm. The M-step in this case consists of four conditional- maximization 
(CM)-steps, corresponding to the decomposition of the parameter vector ^ into four 
sub-vectors ^ = {cx,i3, cr, A)^. Thus, this leads to the following CM steps. 
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( 27 ) 


CM-Step 1 Calculate by maximizing Qi(q:; 

Q,(m+i) _ argmaxQi(a;!^'(”*)). 

Ot 

Contrarily to the case of the standard skew-normal mixture model and skew-normal re¬ 
gression mixture model, this maximization in the case of the proposed SNMoE does not 
exist in closed form. It is performed iteratively by Iteratively Reweighted Least Squares 
(IRLS). 


The Iteratively Reweighted Least Squares (IRLS) algorithm: The IRLS algo¬ 
rithm is used to maximize Qi{q., given by (22) with respect to the parameter a in 

the M step at each iteration m of the ECM algorithm. The IRLS is a Newton-Raphson al¬ 
gorithm, which consists in starting with a vector and, at the /-|-1 iteration, updating 
the estimation of ol as follows: 





-1 

dcxdoL^ 

q ;= q :(0 dcX 


y=Q'(0 


(28) 


where ^ respectively the Hessian matrix and the gradient 

vector of Qi(q:, ^(”^)). At each IRLS iteration the Hessian and the gradient are evaluated 
at a = and are computed similarly as in Chamroukhi et ah (2010)Chamroukhi et ah 
(2009b). The parameter update is taken at convergence of the IRLS algorithm 

(28). Then, for fc = 1..., JL, 


CM-Step 2 Calculate by maximizing Q 2 {^k', 1?'*'™'^) given by (23) w.r.t 13^. Here 

we focus on the common linear case for the experts where each expert-component mean 
function is the one of a linear regression model and has the form (7). It can be easily 
shown that the maximization problem for the resulting skew-normal mixture of linear of 
experts (SNMoLE) can be solved analytically and has the following solution: 


Mm+l) 

Pk 


E 

i=l 


(m) J 


E 

2 = 1 


(g) 


'^ik 


V'. 


dm). 


— Si 'e 


(m) 

l,2fc 


Xr. 


(29) 


CM-Step 3: Calculate by maximizing given by (23) w.r.t cr^. 

Similarly to the update of the analytic solution of this problem is given by: 




E 


2=1 '^ik 


(m) 


Vi- 0. 




X, 


- - /37’”^"d.) + eg 


(30) 


2 1 




(m)\ 




n (m) 
2=1 ' ik 
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CM-Step 4 Calculate by maximizing given by (23) w.r.t Afc, with 

(3j^ and al fixed at and respectively. This consists in solving the following 

equation for 6k to obtain (A; = 1,..., K) as the solution of: 


2(m+l),- i-2\ \ ^ (m) , /i , r2\ \ ^ ("*+!) 

i=l i=l 

(m) 






r. 


(m) 


ik 


2=1 


''2^ik 


+ h/i - /3 


'7’(m+l) 






= 0 - 


(31) 


Then, given the update the update of the skewness parameter is calculated as 

■V (m+l) ^ 

It is obvious to see that when the skewness parameter A^ = = 0 for all k, the 

parameter updates for the SNMoE corresponds to those of the NMoE. Hence, compared 
to the standard NMoE, the SNMoE model is characterized by an additional flexibility 
feature, that is the one to be handle possibly skewed data. However, while the SNMoE 
model is tailored to model the skewness in the data, it may be not adapted to handle 
data containing groups or a group with heavy-tailed distribution. The NMoE and the 
SNMoE may thus be affected by outliers. In the next section, we address the problem of 
sensitivity of normal mixture of experts to outliers and heavy tails. We first propose a 
robust mixture of experts modeling by using the t distribution. 


5 The t mixture of experts (TMoE) model 

The proposed t mixture of experts (TMoE) model is based on the t distribution, which is 
known as a robust generalization of the normal distribution. The t distribution is recalled 
in the following section. We also described its stochastic and hierarchical representations, 
which will be used to derive those of the TMoE model. 


5.1 The t distribution 


The use of the t distribution for mixture components has been shown to be more robust 
than the normal distribution to handle outliers in the data and accommodate data with 
heavy tailed distribution. This has been shown in terms of density modeling and cluster 
analysis for multivariate data (Mclachlan and Peel, 1998; Peel and Mclachlan, 2000) as 
well as for univariate data (Lin et ah, 2007a). The t-distribution with location parameter 
p G M, scale parameter G (0, cxo) and degrees of freedom v G (0, oc) has the probability 
density function 






(32) 


where dy = denotes the Mahalanobis distance between y and jj, (a being the scale 
parameter), and P is the gamma function given by P(x) = dx. The t dis- 
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tribution can be characterized as follows. Let E be an univariate random variable 
with a standard normal distribution with pdf given by 0(.). Then, let W he & ran¬ 
dom variable independent of E and following the gamma distribution, that is hh ~ 
gamma(|, |) where the density function of the gamma distribution is given by f{x; a, b) = 
{fe“a;““^/r(a)} exp(—6a:)l(o,oo)(a^); (a, &) > 0 and the indicator function l(o,oo)(a^) = 1 for 

a: > 0 and is zero elsewhere. Then, a random variable Y having the following representa¬ 
tion: 

y = /i (T^= (33) 

follows the t distribution u) with pdf given by (32). As given in Liu and Rubin 

(1995) for the multivariate case, a hierarchical representation of the t distribution in this 
univariate case can be expressed from the stochastic representation (33) as: 


yi\wi 

W,: 


gamma (f, |)- 


(34) 


5.2 The t mixture of experts (TMoE) model 

The proposed t mixture of experts (TMoE) model extends the t mixture model to the 
MoE framework. The mixture of t distributions have been Erst proposed by Mclachlan 
and Peel (1998); Peel and Mclachlan (2000) for multivariate data. For the univariate case, 
a iP-component t mixture model takes the following form 

K 

A,(|/;/ifc,cTfc,z/fc) (35) 

k=l 

where each of the mixture components has a t density given by (32). Wei (2012) considered 
the t-mixture model for the regression context on univariate data where the means in 
(35) are (linear) regression functions of the form However, this model do not 

explicitly model the mixing proportions as function the inputs; they are assumed to be 
constant. 

The proposed t mixture of experts (TMoE) is MoE model with t-distributed experts 
and is dehned as follows. Let ty(/i, cx^, u) denotes a t distribution with location parameter 
/i, scale parameter a and degrees of freedom z/, whose density is given by (32). A K- 
component TMoE model is then defined by: 

K 

f{y\r,x]^) = '^Tik{,r]cx) E^{y]y{x]f3k),al,Vk) ■ (36) 

k=l 

The parameter vector of the TMoE model is given by iP' = (ctf,..., iP'f,..., 

where = [Yk^al, i/kY is the parameter vector for the kth t expert component which 
has a t distribution. One can see that when the robustness parameter —)■ cx) for each 

fc, the TMoE model (36) approaches the NMoE model (3). 

In the following section, we present the stochastic and hierarchical characterizations 
of the proposed TMoE model and then derive the model maximum likelihood inference 
procedure. 
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5.2.1 Stochastic representation of the TMoE 

By using the stochastic representation (33) of the t distribution, the stochastic represen¬ 
tation for the t mixture of experts (TMoE) is as follows. Let E be a univariate random 
variable following the standard normal distribution E ~ Suppose that, conditional 
on the hidden variable Zi = Zi, a. random variable Wi is distributed as gamma(^, 
Then, given the covariates {xi, rj), a random variable T) is said to follow the TMoE model 
(36) if it has the following representation; 


Yi = ii(Xi,l3,J + a„ 



(37) 


where the categorical variable Zi conditional on the covariate r* follows the multinomial 
distribution as in (16). 

Similarly to the case of the previously presented SNMoE model, the stochastic representa¬ 
tion (37) leads to the following hierarchical representation of the TMoE, which facilitates 
the model inference as it will be presented in Section 6. 


5.2.2 Hierarchical representation of the TMoE 

From (33) and (37), following the hierarchical representation of the mixture of multivariate 
t-distributions (see for example Mclachlan and Peel (1998)), the hierarchical representa¬ 
tion of the TMoE model is written as: 

Yi\wi,Zik = l,Xi ~ N (n{Xi;/3f.),^\ 

V WiJ 

Wi\Zik = l ~ gamma (38) 

Zi\ri ~ Mult (l;7ri(ri; ck), ... ,7rK(n; a)) • 


6 Maximum likelihood estimation of the TMoE model 

Given an i.i.d sample of n observations, the unknown parameter vector l?' can be estimated 
by maximizing the observed-data log-likelihood, which, under the TMoE model, is given 
by: 

n K 

\ogL{^) = '^\og^Tik{,ri]Cx)tVk (2/;/i(a:;/3fc),(Tfc,z/fc) . (39) 

i=l k=l 

To perform this maximization, we hrst use the EM algorithm and then described an 
ECM extension (Meng and Rubin, 1993) as in Liu and Rubin (1995) for a single t dis¬ 
tribution and as in Mclachlan and Peel (1998); Peel and Mclachlan (2000) for mixture of 
f-distributions. 
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6.1 The EM algorithm for the TMoE model 

To maximize the log-likelihood function (39), the EM algorithm for the TMoE model 
starts with an initial parameter vector and alternates between the E- and M- steps 
until convergence. The E-step computes the expected completed data log-likelihood (the 
Q-function) and the M-Step maximize it. From the hierarchical representation of the 
TMoE (39), the complete data consist of the responses {yi,, yn) and their corresponding 
covariates (a^i,..., Xn) and (ri,..., r„), as well as the latent variables (tci,..., Wn) and 
the latent labels (zi,..., Zn)- Thus, the complete-data log-likelihood of '1' is given by: 


n K 


logLc(»?^) = EE ^ifc[log (P(^j = k\ri)) + \og{f {wi\Zik = 1)) + \og{f {yi\ui,Zik = I,®*))] 

1=1 k=l 

K 

= log Lic(a) ^ [ log L 2 c{^k) + log L^ci^k )], (40) 


k=l 


where 

logLic(Q;) 

logLic(‘Z^fc) 
log Lzc{vk) 


n K 


EE Zik log TTkiri^cx), 


z=l k=l 


^ r 1 1 1 , 

^ ^ - log( 27 r) - - log{al) - -Wid: 

i=l 


i=l 




Vk 


T^k 


Vk 


(41) 

(42) 



Wi 


(43) 


6.2 E-Step 

The E-Step of the EM algorithm for the TMoE calculates the Q-function, that is the 
conditional expectation of the complete-data log-likelihood (60), given the observed data 
and a current parameter estimation m being the cnrrent iteration. It can be seen 

from (41), (42) and (43) that computing the Q-function requires the following conditional 
expectations: 


'T'ik — lEr^(m) [Zik\yii Xi, Tij , 

[Wi\yi, Zik = I, Xi, ri] , 

= E^im) [log{Wi)\yi, Zik = 1, Xi, r*] ■ 

It follows that the Q-fnnction is given by 


K 




(44) 


k=l 
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where 


n K 

= '^'^T^J^hogTTkiri;oL 
i=l k=l 
n . 

= E4'”’[-5M2ir)-hog(4)-^4”’4 

i=l 
n 


‘ik 


'ik 


2=1 


1 , f^k\, (Vk\ (Vk\ (m) , (Vk 4 (m)' 

-logr ( -) + (-) log (-) - {-) u.', ' + (y - 1) e'J 


The required conditional expectations are given as follows. First, the conditional expec¬ 
tation (44) corresponds the posterior membership probabilities and is given by: 




'^ik 






(45) 


Then, it can be easily shown (see for example Mclachlan and Peel (1998) and Peel and 
Mclachlan (2000) Liu and Rubin (1995) for details) that: 




(”2) , 1 

1^ (m) 

(m) ‘i'k ’ 


^k ' + 4 


\iog{Wi)\yi,Zik = l,®i,r-i] = log [w^Y'') + | 


+ l' 


-.og(^' 


(46) 


= 


where 'ijj{x) = {dT{x)/dx] /T{x) is the Digamma function. 


6.3 M-Step 

In the M-step, as it can be seen from (44), the Q-function can be maximized by indepen¬ 
dently maximizing Qi{cx.] '1' and, for each k, Q 2 {'^k] Qsiuk] »?'^”*^), with respect 

to a, and Uk, respectively. Thus, on the (m -|- l)th iteration of the M-step, the model 
parameters are updated as follows. 

M-Step 1 Calculate by maximizing w.r.t (x. This can be performed 

iteratively via IRLS (28) as for the mixture of SNMoE. 

M-Step 2 Calculate by maximizing w.r.t ^k = {01 This 

is achieved by hrst maximizing with respect to f3k and then with respect 

to al- For the t mixture of linear experts (TMoLE) case where the expert means are of 
the form (7), this maximization can be performed analytically and provides the following 
updates: 


^(gn+l) 


n 

[E 

i=l 


(m) (m) 


1 -1 


Rfc 


'w, 


ik 


XjX: 


E 

2 = 1 


4 


I?).,,,!™) 


ik 


^ik 


Vi^i 


(48) 
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2(m+l) 


1 


(49) 


(m) 

2-^i=l ^ ik 


E 

2=1 


(m) (m) 

Tik 



(m+1) 



2 


Here, we note that, following Kent et al. (1994) in the case of ML estimation for single 
component t distribution and Mclachlan and Peel (1998); Peel and Mclachlan (2000) for 
mixture of multivariate t distributions, the EM algorithm can be modihed slightly by 
replacing the divisor X]r=i (^9) t>y J2'i=i'''ik'^ik'^- The modihed algorithm may 
converge faster than the conventional EM algorithm. 


M-Step 3 Calculate by maximizing Q 3 (z/fc; w.r.t The degrees of free¬ 
dom update is therefore obtained by iteratively solving the following equation for 

Vk- 


(f) + (t 


+ 1 + 


Z _-/2 = l ' 1 . 


(m) 


ik 


n 

2 = 1 


(m)^ 
ik ) 


W. 


(m) 

ik 


+'iIj 




= 0 . 


(50) 


This scalar non-linear equation can be solved with a root Ending algorithm, such as Brent’s 
method (Brent, 1973). 

It is obvious to see that, as mentioned previously, if the number of degrees of freedom i/k 
is fixed at oo for all k, then the parameter updates for the TMoE model are exactly those 
of the NMoE model (since Wik tends to 1 in this case). The TMoE model constitutes 
therefore a robust generalization of the NMoE model that is able to model data with 
density heaving longer tails than those of the NMoE model. 

After deriving the EM algorithm for the TMoE parameter estimation, now we de¬ 
scribed and ECM extension. 


6.4 The ECM algorithm for the TMoE model 

Following the ECM extension of the EM algorithm for a single t distribution proposed by 
Liu and Rubin (1995) and the one of the EM algorithm for the f-mixture model (Mclachlan 
and Peel, 1998; Peel and Mclachlan, 2000), the EM algorithm for the TMoE model can 
also be modified to give an ECM version by adding an additional E-Step between the two 
M-steps 2 and 3. This additional E-step consists in taking the parameter vector ^ with 
^k = instead of that is 

Q2{i't; «^'”') = «*”■>, 4 ”’)- 

Thus, the M-Step 3 in the above is replaced by a Conditional-Maximization (CM)-Step in 
which the degrees of freedom update (50) is calculated with the conditional expectation 
(46) and (47) computed with the updated parameters respectively 

given by (48) and (49). 
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The SNMoE presented before allows to deal with asymmetric data. The TMoE handles 
the problem of heavy tailed data possibly affected by ontliers. Now, we propose the skew 
t mixture of experts (STMoE) model which attempts to simultaneously accommodate 
heavy tailed data with possible outliers and with asymmetric distribution. 


7 The skew t mixture of experts (STMoE) model 

The proposed skew t mixture of experts (STMoE) model is a MoE model in which the 
expert components have a skew-f density, rather than the standard normal one as in the 
NMoE model, or the previously presented skew-normal and t ones as the SNMoE and 
the TMoE, respectively. The skew-t distribution as well as its stochastic and hierarchical 
representations are recalled in the following section. 


7.1 The skew t distribution 

Let us denote by t^{.) and T^{.) respectively the probability density function (pdf) and the 
cumulative distribution function (cdf) of the t distribution with degrees of freedom u. The 
skew t distribution, introduced by Azzalini and Capitanio (2003), can be characterized as 
follows. Let U be an univariate random variable with a standard skew normal distribution 
U ~ SN(0,1,A) (which can be shortened as f/ ~ SN(A)) with pdf given by (10). Then, 
let W be an univariate random variable independent of U and following the gamma 
distribution, that is, W ~ gamma(|,|). A random variable Y having the following 
representation: 

Y = iji + (51) 

Vie 

follows the skew t distribution ST(/i, A, u) with location parameter p,, scale parameter 
(T, skewness parameter A and degrees of freedom u, whose density is dehned by: 


/(l/;p,a^ A,z/) 


2 

a 


tyiydy) 


Ty+l 



Z/+ 1 
h> -I- 


(52) 


where dy = From the hierarchical distribution of the skew-normal (12), a further 

hierarchical representation of the stochastic representation (51) of the skew t distribution 
is given by: 


Yi\ui,Wi 

Ui\wi 

W, 


N ( p -h 5\ui\, 


1-5^ 


-a 


Wi 


N(0,-), 

Wi 


gamma 


z/ 1 / 
2 ’ 2 


(53) 
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7.2 The skew t mixture of experts (STMoE) model 

The skew proposed t mixture of experts (STMoE) model extends the skew t mixture 
model, which was hrst introduced by Lin et ah (2007a), to the MoE framework. A K- 
component skew t mixture model is given by 

K 

(54) 

k=l 

where each of the mixture components is a skew t density given by (56). In the skew-f 
mixture model (54), the mixing proportions and the expert means are constant, that is, 
they are not function of the inputs. In the proposed STMoE, we consider skew-f expert 
components with regression mean functions, and covariate varying mixing proportions. A 
A-component mixture of skew t experts (STMoE) is therefore dehned by: 

K 

f{y\r,x;^) = ^ 7rfc(r; a) ST(|/;/i(a;;/3fc), Afc, z/^)- (55) 

k=l 

The parameter vector of the STMoE model is iP' = (ctf,..., iP'f,..., where 

^k = {/3l o'fc, Afc, UkY is the parameter vector for the kth. skew t expert component whose 
density is dehned by 

f{y\x;Yx;(3k),a^,X,Y = ^ Uidy^x)) T^+i dy{x) (56) 

where dy{x) = represents the Mahalanobis distance between y and y{x] (3^). 

It can be seen that, when the robustness parameter —)■ oo for each k, the STMoE 
model (55) reduces to the SNMoE model (14). On the other hand, if the skewness 
parameter A^ = 0 for each k, the STMoE model reduces to the TMoE model (36). 
Moreover, when uj. ^ oo and A^ = 0 for each fc, it approaches the standrad NMoE model 
(3). This therefore makes the STMoE hexible as it generalizes the previously described 
models to accommodate situations with asymmetry, heavy tails, and outliers. 

7.3 Stochastic representation of the STMoE model 

The skew t mixture of experts model is characterized as follows. Suppose that conditional 
on a categorical variable Zi G {1,..., A} representing the hidden label of the component 
generating the ith observation and following the multinomial distribution (16), a random 
variable has the following representation: 

Yi = fi{xi-,l3Y + (57) 

where Aj and Wi are independent univariate random variables with, respectively, a stan¬ 
dard skew normal distribution Aj ~ SN(A 2 j, and a Gamma distribution Wi ~ gamma(l^, ^), 
and Xi and r, are some given covariate variables. Then, the variable Yi is said to follow 
the skew t mixture of experts (STMoE) dehned by (55). 
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7.4 Hierarchical representation of the STMoE model 

From the hierarchical representation (54) of the skew t distribution, a hierarchical model 
for the proposed STMoE model (55) can be derived from its stochastic representation 
(57) and is as follows: 


Wi^ ^ik f 7 
Ui\Wi^ 1 

'Wi\Zik = 1 
Zi\ri 


N ( + Sk\ui\, 


-CT: 


Wi 


k y 


at 




gamma 


^k ^k 

T’ T 

Mult (l; 7ri(r,; a),..., a)). 


(58) 


This hierarchical representation will be used to derive the maximum likelihood estimation 
of the STMoE model parameters ^ by using the ECM algorithm. 


8 Maximum likelihood estimation of the STMoE model 

The unknown parameter vector ^ of the STMoE model is estimated by maximizing the 
following observed-data log-likelihood given an observed i.i.d sample of n observations yi 
and their corresponding covariates Xi and Uj: 

n K 

\ogL{^) = ^log^7rfc(ri;Q;)ST(|/;/i(a;i;/3fc),(Tfc, Afc,z/fc)- (59) 

i=l k=l 

We perform this iteratively by a dedicated ECM algorithm. The complete data consist of 
the observations {yi, , yn), their corresponding covariates {xi, ..., Xn) and (ri,..., r„), 
as well as the latent variables (ui,..., Un) and (tci,..., Wn) and the latent labels {zi, ..., z„). 
Then, from the hierarchical representation of the STMoE (59), the complete-data log- 
likelihood of ^ is given by: 


logL.i^) 


n K 




log (P {Zi = k\ri)) + log (/ {wi\Zik 


1 )) + 


log (/ {Ui\wi, Zik = 1)) -7 log (/ {yi\ui, Zik = 1, Xi)) 

K 

log Lic{a) ^ [ log L 2 c{^k) + log h 3 c(nfc)] 

k=l 


(60) 
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where '^k 
logLicick) 
log L 2 c{^k) 

logL3c(t'fc) 


= (/^r, Afc)^ and 

n K 


EE Zik log TTk{ri]oc), 


i=l k=l 


^ Zik - log(27r) - log(o-i) - - log(l - 5l) 


Wi d^f, ^ Wi Ui 6k dik 


2=1 

n 


2(1 - ^l) (1 - 


E^., -,ogr(f) + (^).ogm + (f).ogK)-(f)» 


Vk 


Vk 


k'k 


k'k 


i=l 


Wj uf 


8.1 The ECM algorithm for the STMoE model 

The ECM algorithm for the STMoE model starts with an initial parameter vector 
and alternates between the E- and CM- steps until convergence. 


8.2 E-Step 

The E-Step of the CEM algorithm for the STMoE calculates the Q-function, that is the 
conditional expectation of the complete-data log-likelihood (60), given the observed data 
and a current parameter estimation m being the current iteration. 

From (60), it can be seen that computing the Q-function requires the following conditional 
expectations: 

Tjfc ~ lEr^(m) \Zik\yii Xi^ Tj}^ , 

= E^(™) \Wi\yi, Zik = I, Xi, r,] , 

[lTjl7j||/j, Zik 1) Xi^ Uj], 

“ E^(m) \\ViUi \yi^Zik = , 

47fc = [log(lTj) \yi, Zik = 1, Xi, Vi] ■ 

The Q-function being given by: 


K 

Q(,P; (c,; ) + ^ (^^ ^ ^(m) ) ^ gs (^^ ^ ) J ^ 

k=l 

where 
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Following the expressions of these conditional expectations given namely in the case of 
the standard skew t mixture model (Lin et ah, 2007a), the conditional expectations for 
the case of the proposed STMoE model can be expressed similarly as: 


(m) 


W, 


(m) 

ik 


^,(r, a'"') ST(«i d”’. 4”’ 
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,("i) I j 2 (m) 
u “T tl-u 
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14 ^ 

'^+1 ) 
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where 


,(^)ij2 (^) ’ 


'k 


(62) 


(63) 



We note that, for (66), we adopted a one-step-late (OSL) approach to compute the 
conditional expectation Cg™] as described in Lee and McLachlan (2014), by setting the 
integral part in the expression of the corresponding conditional expectation given in (Lin 
et ah, 2007a) to zero, rather than using a Monte Carlo approximation. We also mention 
that, for the multivariate skew t mixture models, recently Lee and McLachlan (2015) 
presented a series-based truncation approach, which exploits an exact representation of 
this conditional expectation and which can also be used in place of (66). 


8.3 M-Step 

The M-step maximizes the Q-function (61) with respect to ^ and provides the parameter 
vector update From (61), it can be seen that the maximization of Q can be 

performed by separately maximizing Qi with respect to the parameters a of the mixing 
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proportions, and for each expert /c (/c = 1,..., K), Q 2 with respect to al)^ and Afc, 
and Qs with respect to The maximization of Q 2 and Q 3 is carried ont by conditional 
maximization (CM) steps by npdating (/3^, al) and then npdating (A, Uk) with the given 
npdated parameters. This leads to the following CM steps. On the (m + l)th iteration 
of the M-step, the STMoE model parameters are npdated as follows. 

CM-Step 1 Calcnlate the parameter maximizing the fnnction Qi(q:; given 

by (45) by nsing IRLS (28). Then, for /c = 1..., iC, 


CM-Step 2 Calcnlate maximizing Q 2 i^k; w.r.t (/3j, al)^. 

For the t mixtnre of linear experts (TMoLE) case, where the expert means are linear 
regressors, that is, of the form (7), this maximization can be performed in a closed form 
and provides the following npdates: 
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CM-Step 3 The skewness parameters A^ are npdated by maximizing Q 2 (^A;; 
w.r.t Afc, with (3k and al hxed at the npdate (3^^^^^'^ and ak^'^^^\ respectively. It can be 

easily shown that the maximization to obtain (/c = 1,..., K) consists in solving 

the following eqnation in 5k'- 
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CM-Step 4 Similarly, the degrees of freedom i/k are npdated by maximizing Qsii'k] 
w.r.t Uk with (3k and al hxed at (3]^~^^^ and a1^"^^^\ respectively. An npdate is 

calcnlated as solntion of the following eqnation in Uk- 

(m) ( Am) 

Z_-/ 2 =l ik I ^3,2/c 

E n {m 
2=1 '^ik 

The two scalar non-linear eqnations (69) and (70) can be solved similarly as in the TMoE 
model, that is with a root Ending algorithm, snch as Brent’s method (Brent, 1973). 

As mentioned before, one can see that, when the robnstness parameter Uk ^ 00 for 
all the components, the parameter npdates for the STMoE model correspond to those 



w, 


(m) 

ik 


= 0. 


(70) 
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of the SNMoE model. On the other hand, when the skewness parameters Xk = 0, the 
STMoE parameter updates correspond to those of the TMoE model. Finally, when both 
the degrees of freedom —)■ oo and the skewness Xk = 0, we obtain the parameter updates 
of the standard NMoE model. The STMoe therefore provides a more general framework 
for inferring flexible MoE models. 


9 Prediction using the NNMoE 

The goal in regression is to be able to make predictions for the response variable(s) given 
some new value of the predictor variable(s) on the basis of a model trained on a set of 
training data. In regression analysis using mixture of experts, the aim is therefore to 
predict the response y given new values of the predictors {x^r), on the basis of a MoE 
model characterized by a parameter vector ^ inferred from a set of training data, here, by 
maximum likelihood via EM. These predictions can be expressed in terms of the predictive 
distribution of y, which is obtained by substituting the maximum likelihood parameter ^ 
into (l)-(2) to give: 

K 

f{y\x,r;^) = '^7ik{r;6L)fk{y\x;^k)- 

k=l 

Using /, we might then predict y for a given set of x^s and r’s as the expected value 
under /, that is by calculating the prediction y = E^(F|r, x). We thus need to compute 
the expectation of the mixture of experts model. It is easy to show (see for example 
Section 1.2.4 in Fruhwirth-Schnatter (2006)) that the mean and the variance of a mixture 
of experts distribution of the form (9) are respectively given by 

K 

Y,^k{r-,ar;)^^{Y\Z = k,x), (71) 

k=l 
K 

Y,T^k{r;an)[{^^{Y\Z = k,x)f+N^{Y\Z = k,x)\ - [E^(y|r,®)]'(72) 
k=l 

where E,^(Y\Z = k,x) and Y^(Y\Z = k,x) are respectively the component-specihc 
(expert) means and variances. The mean and the variance for the MoE models described 
here are given as follows. 

NMoE For the NMoE model, the normal expert means and variances are respectively 

given by E^{Y\Z = k,x) = Pj^x and Y^{Y\Z = k,x) = Then, from (71) it follows 
that the mean of the NMoE is given by 


E^{Y\r,x) = 

Y^{Y\r,x) = 


K 

E^{Y\r,x) = '^7rk{r]dLn)^lx. (73) 

k=l 

Then, the expected value for each of the three proposed MoE models is given as follows. 
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SNMoE From the mean and the variance of the skew-normal distribution, which can 
be calculated as in Genton et ah (2001) (for the multivariate case), and which are given 
in Lemma 1 in Lin et ah (2007b) for this scalar case, the expert means and variances for 
the SNMoE model are calculated similarly and respectively given by: 


and 


where Sk 
given by: 


K^(Y\Z = fc, x) 



W^{Y\Z = k,x) 



(Ti. 




Then, from (71) it follows that the mean of the SNMoE model is 


K 


E^{Y\r,x) = ^Trk{r;a)(^p,^x + J-SkCTk 


k=l 


(74) 


TMoE For the TMoE model, by using the expressions of the mean and the variance 
of the t distribution, it follows that for the TMoE model, for > 1, the expert means 

-V r 

are given by E^(Y\Z = k^x) = (3f^x and, for Ok > 2, the expert variances are given by 
N^{Y\Z = k,x) = (j|. Then, from (71), the mean of the TMoE model is therefore 

given by: 


K ^ 

E^{Y\r,x) = ^7rfc(r;Q:)^fciC. 

k=l 


(75) 


STMoE The mean and the variance for a skew t random variable, for this scalar case, 
can be easily computed as in Section 4.2 in Azzalini and Capitanio (2003) for a non-zero 
location parameter. Thus, for the STMoE model, the expert means for hfc > 1, are given 
by 

E^{Y\Z = k,x) = ^Ix + dk Sk ^{h) 
and the expert variances for > 2 are given by 

Y^(,Y\Z = k,x)= - Si e(h)j 


where ^(T'fc) 
by: 



Then, following (71), the mean of the STMoE is thus given 


K 

E^{Y\r,x) = 'Y^TTk{r-,di)(^^^f.x + cFkh (76) 

k=l 


Finally, the variance for each MoE model is obtained by using (72) with the specihed 
expert mean and variance calculated in the above. 
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10 Model-based clustering using the NNMoE 

The MoE models can also be used for a model-based clustering perspective to provide 
a partition of the regression data into K clusters. Model-based clustering using the 
NNMoE consists in assuming that the observed data are generated from a 

K component mixture of, respectively, skew-normal, t or skew t experts, with parameter 
vector The mixture components can be interpreted as clusters and hence each cluster 
can be associated with a mixture component. The problem of clustering therefore becomes 
the one of estimating the MoE parameters which is performed here by using dedicated 
EM algorithms. Once the parameters are estimated, the provided posterior membership 
probabilities Tik represent a fuzzy partition of the data. These posterior memberships 
are given by, (24), (45), (62), for, respectively the SNMoE, the TMoE, and the STMoE. 
A hard partition of the data can then be obtained from the posterior memberships by 
applying the MAP rule, that is, by maximizing the posterior cluster probabilities to assign 
each observation to a cluster: 


K , 

Zj = arg max r^k 


where Zi represents the estimated cluster label for the ith observation. 


(77) 


11 Model selection for the NNMoE 

One of the issues in mixture model-based clustering is model selection. The problem of 
model selection for the NNMoE models presented here in their general forms, is equivalent 
to the one of choosing the optimal number of experts K, the degree p of the polynomial 
regression and the degree q for the logistic regression. The optimal value of {K, p, q) can be 
computed by using some model selection criteria such as the Akaike Information Criterion 
(AIC) (Akaike, 1974), the Bayesian Information Criterion (BIC) (Schwarz, 1978) or the 
Integrated Classification Likelihood criterion (ICL) (Biernacki et ah, 2000), etc. The AIC 
and BIC are are penalized observed data log-likelihood criteria which can be dehned as 
functions to be maximized and are respectively given by: 

klC(K,p,q) = logL(>i-)-((L^, 

BlC(K,p,q) = logL(g)- 

The ICL criterion consists in a penalized complete-data log-likelihood and can be ex¬ 
pressed as follows: 

ICL(A>,g) = logL,(9) - 

In the above, logL(l^) and log Lc(?^) are respectively the incomplete (observed) data log- 
likelihood and the complete data log-likelihood, obtained at convergence of the E(C)M 
algorithm for the corresponding mixture of experts model and is the number of free 





model parameters. The number of free parameters is given by 77 ^ = K{p + q + 3)—q — l 
for the NMoE model, 77 ^ = K{p + q + A) — q — 1 ioi both the SNMoE and the TMoE 
models, and = K{p + q +5) — q — 1 for the STMoE model. 

However, note that in MoE it is common to use mixing proportions modeled as logistic 
transformation of linear functions of the covariates, that is the covariate vector in (2) is 
given by r* = (l,rj)^ (corresponding to g = 2), r* being an univariate covariate variable. 
This is also adopted in this work. Moreover, for the case of linear experts, that is when 
the experts are linear regressors with parameter vector for which the corresponding 
covariate vector Xi in (7) is given by Xi = (1,0;*)^ (corresponding to p = 2), r* being 
an univariate covariate variable, the model selection reduces to choosing the number of 
experts K. Here we mainly consider this linear case. 

12 Experimental study 

This section is dedicated to the evaluation of the proposed approach on simulated data 
and real-world data . We evaluated the performance of proposed EM algorithms for 
the NNMoE models in terms of modeling, robustness to outliers and clustering. The 
algorithms have been implemented in Matlab. 

12.1 Initialization and stopping rules 

The parameters (/c = 1,..., 77 — 1) of the mixing proportions are initialized randomly, 
including an initialization at the null vector for one run (corresponding to equal mixing 
proportions). Then, the common parameters {k = 1,... ,K) are initialized from 

a random partition of the data into K clusters. This corresponds to htting a normal 
mixture of experts where the initial values of the parameters are respectively given by 
(8) and (9) with the posterior memberships Tik replaced by the hard assignments Zik 
issued from the random partition. For the TMoE and STMoE models, the robustness 
parameters Uk {k = 1,, K) can be initialized randomly in the range [1, 200]. For the 
SNMoE and STMoE, the skewness parameters \k {k = 1,..., K) can be initialized by 
randomly initializing the parameter 6k in (—1,1) from the relation Xk = Then, 

the proposed E(C)M algorithm for each model is stopped when the relative variation 
of the observed-data log-likelihood 0 caches a prehxed threshold (for 

example e = 10“®). For each model, this process is repeated 10 times and and the solution 
corresponding the highest log-likelihood is hnally selected. 

12.2 An illustrative example 

We hrst start by an illustrative example by considering a non-linear arbitrary data set 
which was analyzed by Bishop and Svensen (2003) and elsewhere. This data set consists 
of 77 = 250 values of input variables Xj generated uniformly in (0,1) and output variables 
Hi generated as Pi = Xi + 0.3 sin(27rxi) -|- e*, with e* drawn from a zero mean Normal 
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distribution with standard deviation 0.05. To apply the MoE models, we set the covariate 
vectors {xi,ri) to Xi = Vi = We considered mixture of three linear experts as in 

Bishop and Svensen (2003). 

Figure 1 shows the expert mean functions of each of the htted MoE models, the 
corresponding partitions obtained by using the Bayes’ rule, and the mixing proportions 
as function of the inputs. One can observe that the four models are successfully applied 
and provide very similar results. The results obtained by the proposed NNMoE models 
are indeed close to the one obtained by the NMoE. 
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Figure 1: Fitting the NMoE model and the three proposed non-normal mixture of experts 
models (SNMoE, TNMoE, STMoE) to the toy data set analyzed in Bishop and Svensen 
(2003). 
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12.3 Experiments on simulation data sets 

In this section we perform an experimental study on simulated data sets to apply and 
assess the proposed models. Two sets of experiments have been performed. The first 
experiment aims at observing the effect of the sample size on the estimation quality and 
the second one aims at observing the impact of the presence of outliers in the data on the 
estimation quality, that is the robustness of the models. 

12.3.1 Experiment 1 

For this first experiment on simulated data, each simulated sample consisted of n obser¬ 
vations with increasing values of the sample size n : 50,100, 200, 500,1000. The simulated 
data are generated from a two component mixture of linear experts, that is iF = 2,p = 
q = 1. The covariate variables {xi,ri) are simulated such that Xi = Vi = (1,0:,)^ where 
Xi is simulated uniformly over the interval (—1,1). We consider each of the four models 
for data generation (NNMoE, SNMoE, TMoE, STMoE), that is, given the covariates, the 
response yi\{xi, r,; '^} is simulated according to the generative process of the models (3), 
(14), (36), and (14). For each generated sample, we fit each of the four models. Thus, 
the results are reported for all the models with data generated from each of the models. 
We consider the mean square error (MSE) between each component of the true parame¬ 
ter vector and the estimated one, which is given by \^j — The squared errors are 

averaged on 100 trials. The used simulation parameters ^ for each model are given in 
Table 1. 



parameters 

component 1 
component 2 

Qi = (0,10)'^' /3i = (0,l)'^' <71= 0.1 Ai = 3 ni = 5 

Q2 = (0,0)^ /32 = (0,-1)^ ct2 = 0.1 A2 = -10 U2 = 7 


Table 1: Parameter values used in simulation. 


12.3.2 Obtained results 

Tables 2, 3, and 4 show the obtained results in terms of the MSE for respectively the 
SNMoE, the TMoE, and the STMoE. One can observe that, for the three proposed mod¬ 
els, the parameter estimation error is decreasing as n increases, which confirms the con¬ 
vergence property of the maximum likelihood estimator For details on the convergence 
property of the MLE for mixture of experts, see for example (Jiang and Tanner, 1999). 
One can also observe that the error decreases significantly for n > 500, especially for the 
regression coefficients and the scale parameters. 
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param. 

n 

aio 

Oil 

/3io 

/3ii 

ho 

hi 

O'! 

(72 

Ai 

h 

50 

1.10105 

4.1882 

0.00916 

0.004890 

0.007370 

0.00348000 

0.001647 

0.002234 

3.000 

4.999 

100 

0.28074 

1.0663 

0.008301 

0.0006118 

0.006360 

0.00007904 

0.001314 

0.001650 

2.999 

5.000 

200 

0.03893 

0.9343 

0.004709 

0.0000398 

0.005962 

0.00005873 

0.001142 

0.001552 

2.999 

5.000 

500 

0.02340 

0.0908 

0.004475 

0.0000195 

0.005803 

0.00000796 

0.001026 

0.001521 

3.000 

4.999 

1000 

0.00025 

0.0613 

0.003912 

0.0000012 

0.005499 

0.00000344 

0.000667 

0.001517 

2.999 

3.999 


Table 2: MSE between each component of the estimated parameter vector of the SNMoE 
model and the actnal one for a varying sample size n. 


param. 

n 

«10 

au 

ho 

hi 

ho 

hi 

(7l 

(72 

Kl 

^2 

50 

1.3059 

6.4611 

0.0214130 

0.0290114 

0.0044140 

0.0192600 

0.0010655 

0.0003317 

37.956 

11.722 

100 

1.2150 

4.5056 

0.0024706 

0.0117546 

0.0005275 

0.0007891 

0.0001450 

0.0002301 

6.1528 

10.412 

200 

0.0341 

3.8193 

0.0001553 

0.0007335 

0.0002022 

0.0005061 

0.0000504 

0.0000262 

2.0975 

6.3710 

500 

0.0356 

2.2633 

0.0000112 

0.0000214 

0.0001337 

0.0002163 

0.0000126 

0.0000007 

0.4859 

5.4937 

1000 

0.0053 

1.2510 

0.0000018 

0.0000258 

0.0000005 

0.0000427 

0.0000126 

0.0000004 

0.0014 

2.7844 


Table 3: MSE between each component of the estimated parameter vector of the TMoE 
model and the actnal one for a varying sample size n. 


param. 

n 

aio 

Qfll 

ho 

hi 

ho 

hi 

(7l 

(72 

Ai 

h 

Vl 

^2 

50 

0.5256 

5.737 

0.000965 

0.002440 

0.004388 

0.000667 

0.000954 

0.000608 

3.1153 

16.095 

15.096 

4.64311 

100 

0.4577 

1.815 

0.000847 

0.000852 

0.000742 

0.000660 

0.000844 

0.000303 

2.0131 

7.844 

5.360 

0 . 2635 ' 

200 

0.2478 

0.785 

0.000816 

0.000348 

0.000473 

0.000556 

0.000362 

0.000297 

0.7004 

3.8470 

3.135 

0.16721 

500 

0.0316 

0.565 

0.000363 

0.000091 

0.000314 

0.000398 

0.000091 

0.000061 

0.0078 

1.0785 

0.2230 

0.00861 

1000 

0.0085 

0.068 

0.000261 

0.000076 

0.000233 

0.000116 

0.000026 

0.000002 

0.0028 

0.5545 

0.0494 

0 . 0007 ! 


Table 4: MSE between each component of the estimated parameter vector of the STMoE 
model and the actnal one for a varying sample size n. 


In addition to the previonsly showed results, we plotted in Figures 2, 3, 4, and 5 the 
estimated quantities provided by applying the proposed models and their true counter¬ 
parts for n = 500 for the same the data set which was generated according the NMoE 
model. The upper-left plot of each of these hgures shows the estimated mean function, 
the estimated expert component mean functions, and the corresponding true ones. The 
upper-right plot shows the estimated mean function and the estimated conhdence region 
computed as plus and minus twice the estimated (pointwise) standard deviation of the 
model as presented in Section 9, and their true counterparts. The middle-left plot shows 
the true expert component mean functions and the true partition, and the middle-right 
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plot shows their estimated counterparts. Finally, the bottom-left plot shows the log- 
likelihood prohle during the EM iterations and the bottom-right plot shows the estimated 
mixing probabilities. 

One can clearly see that the estimations provided by each of the proposed models are 
very close to the true ones which correspond to those of the NMoE model in this case. 
This provides an additional support to the fact that the proposed algorithms perform well 
and the corresponding proposed models are good generalizations of the normal mixture of 
experts (NMoE), as they clearly approach the NMoE as shown in this simulated example. 

Figure 6 shows the true and estimated MoE mean functions and expert mean functions 
by htting the proposed NNMoE models to a simulated data set of n = 500 observations. 
Each model was considered for data generation. The upper plot corresponds to the 
SNMoE model, the middle plot to the TMoE model and the bottom plot to the STMoE 
model. Finally, Figure 7 shows the corresponding true and estimated partitions. Again, 
one can clearly see that both the estimated models are precise. The htted functions are 
close to the true ones. In addition, one can also see that the partitions estimated by 
the NNMoE models are close the actual partitions. The proposed NNMoE models can 
therefore be used as alternative to the NMoE model for both regression and model-based 
clustering. 
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Figure 2: Fitted NMoE model to a data set generated according to the NMoE model. 
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Figure 3: Fitted SNMoE model to a data set generated according to the NMoE model. 
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Figure 4: Fitted TMoE model to a data set generated according to the NMoE model. 
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Figure 5: Fitted STMoE model to a data set generated according to the NMoE model. 
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Figure 6: The true and estimated mean function and expert mean functions by fitting 
the proposed NNMoE models to a simulated data set of n = 500 observations. Up: the 
SNMoE model; Middle: the TMoE model; Bottom: the STMoE model. 
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Figure 7: The true and estimated partitions by fitting the proposed NNMoE models to 
the simulated data set shown in Figure 6. Up: the SNMoE model; Middle: the TMoE 
model; Bottom: the STMoE model. 

12.3.3 Experiment 2 

In this experiment we examine the robustness of the proposed models to outliers versus 
the standard NMoE one. For that, we considered each of the four models (NMoE, SNMoE, 
TMoE, and STMoE) for data generation. For each generated sample, each of the four 
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models in considered for the inference. The data were generated exactly in the same way 
as in Experiment 1, except for some observations which were generated with a probability 
c from a class of ontliers. We considered the same class of ontliers as in Ngnyen and 
McLachlan (2014), that is the predictor x is generated nniformly over the interval (—1,1) 
and the response y is set the valne —2. We apply the MoE models by setting the covariate 
vectors as before, that is, a; = r = (l,a;)^. We considered varying probability of outliers 
c = 0%, 1%, 2%, 3%, 4%, 5% and the sample size of the generated data is n = 500. An 
example of simulated sample containing 5% outliers is shown in Figure 8. As a criterion 
of evaluation of the impact of the outliers on the quality of the results, we considered 
the MSE between the true regression mean function and the estimated one. This MSE is 
calculated as ^ ~ where the expectations are computed 

as in Section 9. 

12.3.4 Obtained results 

Table 5 shows, for each of the fours models, the results in terms of mean squared error 
between the true mean function and the estimated one, for an increasing number of outliers 
in the data. First, one can see that, when there is no outliers (c = 0%), the error of the 
TMoE is less than those of the other models, for the four situations, that is including the 
case where the data are not generated according to the TMoE model, which is somewhat 
surprising. This includes the case where the data aregenrated according to the NMoE 
model, for which the TMoE error is slightly less than the one of the NMoE model. Then, 
it can be seen that when there is outliers, the TMoE model outperforms the other models 
for almost all the situations, except the one in which the data are generated according to 
the STMoE model. When the data do not contain outliers and are generated from the 
STMoE, this one indeed outperforms the NMoE and SNMoE models. For the situation 
when there is no outliers and the data are generated according to the TMoE or the 
STMoE, these two models may provide quasi-identical results. In the case of presence of 
outliers in data generated from the STMoE, this one outperforms the NMoE and SNMoE 
models for all the situations, and outperforms the TMoE for the majority of situations, 
namely when the number of the outliers is more than 2%. One can also see that, for all the 
situations with outliers, as expected, the TMoE and STMoE models always provide the 
best results. These two models are indeed much more robust to outliers compared to the 
normal and skew-normal ones because the expert components in these two models follow 
a robust distribution, that is the t distribution for the TMoE, and the skew t distribution 
for the STMoE. The NMoE and SNMoE are sensitive to outliers. When there is outliers, 
the SNMoE behavior is comparable to the one of the NMoE. However, when the number 
of outliers is increasing, it can be seen that the increase in the error of the NMoE and 
SNMoE model is more pronounced compared to the one of the TMoE and STMoE models. 
The error for both the TMoE and STMoE may indeed slightly increase, remain stable or 
even decrease in some situations. This supports the expected robustness of the TMoE 
and STMoE models. 
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c 

0% 

1% 

2% 

3% 

4% 

5% 


Model 







w 

o 

NMoE 

0.0001783 

0.001057 

0.001241 

0.003631 

0.013257 

0.028966 

SNMoE 

0.0001798 

0.003479 

0.004258 

0.015288 

0.022056 

0.028967 


TMoE 

0.0001685 

0.000566 

0.000464 

0.000221 

0.000263 

0.000045 


STMoE 

0.0002586 

0.000741 

0.000794 

0.000696 

0.000697 

0.000626 

o 

NMoE 

0.0000229 

0.000403 

0.004012 

0.002793 

0.018247 

0.031673 

SNMoE 

0.0000228 

0.000371 

0.004010 

0.002599 

0.018247 

0.031674 


TMoE 

0.0000325 

0.000089 

0.000130 

0.000513 

0.000108 

0.000355 


STMoE 

0.0000562 

0.000144 

0.000022 

0.000268 

0.000152 

0.001041 

W 

o 

NMoE 

0.0002579 

0.0004660 

0.002779 

0.015692 

0.005823 

0.005419 

SNMoE 

0.0002587 

0.0004659 

0.006743 

0.015686 

0.005835 

0.004813 


TMoE 

0.0002529 

0.0002520 

0.000144 

0.000157 

0.000488 

0.000045 


STMoE 

0.0002473 

0.0002451 

0.000173 

0.000176 

0.000214 

0.000291 

o 

NMoE 

0.000710 

0.0007238 

0.001048 

0.006066 

0.012457 

0.031644 

SNMoE 

0.000713 

0.0009550 

0.001045 

0.006064 

0.012456 

0.031644 


TMoE 

0.000279 

0.0003808 

0.000371 

0.000609 

0.000651 

0.000609 

cyj 

STMoE 

0.000280 

0.0001865 

0.000447 

0.000600 

0.000509 

0.000602 


Table 5: MSE between the estimated mean function and the true one for each of the 
four models for a varying probability c of outliers for each simulation. The hrst column 
indicates the model used for generating the data and the second one indicates the model 
used for inference. 


To highlight the robustness to noise of the TMoE and STMoE models, in addition to 
the previously shown numerical results, hgures 8, 9, 10, and 11 show an example of results 
obtained on the same data set by, respectively, the NMoE, the SNMoE, TMoE, and the 
STMoE. The data are generated by the NMoE model and contain c = 5% of outliers. 

In this example, we clearly see that both the NMoE model and the SNMoE are severely 
affected by the outliers. They provide a rough £t especially for the second component 
whose estimation is affected by the outliers. However, one can see that both the TMoE 
and the STMoE model clearly provide a precise fit; the estimated mean functions and 
expert components are very close to the true ones. The TMoE and the STMoE are robust 
to outliers, in terms of estimating the true model as well as in terms of estimating the 
true partition of the data (as shown in the middle plots of the data). Notice that for the 
TMoE and the STMoE, the confidence regions are not shown because for this situation the 
estimated degrees of freedom are less than 2 (1.5985 and 1.5253 for the TMoE, and 1.6097 
1.5311 for the STMoE); Hence the variance for these models in that case is not dehned 
(see Section 9). The TMoE and STMoE models provide indeed components with small 
degrees of freedom corresponding to highly heavy tails, which allow to handle outliers in 
this noisy case. 


42 








































EM iteration number 



Figure 8: Fitted NMoE model to a data set of n = 500 observations generated according 
to the NMoE model and including 5% of outliers. 
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Figure 9: Fitted SNMoE model to a data set of n = 500 observations generated according 
to the NMoE model and including 5% of outliers. 
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Figure 10: Fitted TMoE model to a data set of u = 500 observations generated according 
to the NMoE model and including 5% of outliers. 
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Figure 11: Fitted STMoE model to a data set of n = 500 observations generated according 
to the NMoE model and including 5% of outliers. 

12.4 Application to two real-world data sets 

In this section, we consider an application to two real-world data sets: the tone perception 
data set and the temperature anomalies data set shown in Figure 12. 
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Tone perception data 


Temperature anomaly data 
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Figure 12: Scatter plots of the tone perception data and the temperature anomalies data. 

12.4.1 Tone perception data set 

The hrst analyzed data set is the real tone perception data set^ which goes back to 
Cohen (1984). It was recently studied by Bai et ah (2012) and Song et ah (2014) by 
using robust regression mixture models based on, respectively, the t distribution and the 
Laplace distribution. In the tone perception experiment, a pure fundamental tone was 
played to a trained musician. Electronically generated overtones were added, determined 
by a stretching ratio (“stretch ratio” = 2) which corresponds to the harmonic pattern 
usually heard in traditional dehnite pitched instruments. The musician was asked to tune 
an adjustable tone to the octave above the fundamental tone and a “tuned” measurement 
gives the ratio of the adjusted tone to the fundamental. The obtained data consists of n = 
150 pairs of “tuned” variables, considered here as predictors {x), and their corresponding 
“strech ratio” variables considered as responses (y). To apply the proposed MoE models, 
we set the response yi{i = 1,..., 150) as the “strech ratio” variables and the covariates 
Xi = ri = (Cxj)^ where Xi is the “tuned” variable of the zth observation. We also follow 
the study in Bai et ah (2012) and Song et ah (2014) by using two mixture components. 
Model selection results are given later in Table 7. 

Figure 13 shows the scatter plots of the tone perception data and the linear expert 
components of the htted NMoE model and the proposed SNMoE, TMoE, and STMoE 
models. One can observe that we obtain a good ht with all the models. The NMoE and 
SNMoE are quasi-identical, and differ very slightly from those of the TMoE and STMoE, 
which are very similar. The two regression lines may correspond to correct tuning and 
tuning to the hrst overtone, respectively, as analyzed in Bai et ah (2012). Figure 14 shows 
the log-likelihood prohles for each of the four models. It can namely be seen that training 
the t mixture of experts for this experiment may take more iterations than the normal 
and the skew-normal MoE models. The STMoE has indeed more parameters to estimate 
than the other ones. However, in terms of computing time, all the models converge in 

^Source: http;//artax.karlin.mff.cuni.cz/r-help/library/fpc/html/tonedata.html 
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Figure 13: The fitted MoLE to the original tone data set. Upper-left: NMoE model, 
Upper-right: SNMoE model, Bottom-left: TMoE model. Bottom-right: STMoE model. 
The predictor x is the actual tone ratio and the response y is the perceived tone ratio. 
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only few seconds on a personal laptop (withe 2,9 GHz processor and and 8 GB memory). 
The values of estimated parameters for the tone perception data set are given in Table 
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Figure 14: The log-likelihood during the EM iterations when fitting the MoLE models 
to the original tone data set. Upper-left: NMoE model, Upper-right: SNMoE model, 
Bottom-left: TMoE model. Bottom-right: STMoE model. 

6. One can see that the regression coefficients are very similar for all the models, except 
for the hrst component of the TMoE model. This can be observed on the fit in Figure 13 
where the first expert component for the TMoE model slightly differ from the one of the 
other ones. One can also see that the SNMoE model parameters are identical to those of 
the NMoE, with a skewness close to zero. For the STMoE model, it retrieves a skewed 
component and with high degrees of freedom compared to the other component. This 
component may be seen as approaching the one of the SNMoE model, while the second 
one in approaching a t distribution, that is the one of the TMoE model. 

We also performed a model selection procedure on this data set to choose the best 
number of MoE components for a number of components between 1 and 5. We used 
BIG, AIC, and ICL. Table 7 gives the obtained values of the model selection criteria. 
One can see that for the NMoE model overestimate the number oc components. AIC 
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param. oio 

model 

an dio 

Pn 

P20 

/321 

Ul 

(T2 Ai 

A2 

Ul 

V2 

NMoE -2.690 

0.796 -0.029 

0.995 

1.913 

0.043 

0.137 

0.047 

- 

- 

- 

SNMoE -2.694 

0.797 -0.029 

0.995 

1.913 

0.043 

0.137 

0.047 5.2e-13 

-1.65e-13 

- 

- 

TMoE -0.058 

-0.070 0.002 

0.999 

1.956 

0.027 

0.002 

0.029 

- 

0.555 

2.017 

STMoE -3.044 

0.824 -0.058 

0.944 

1.944 

0.032 

0.200 

0.032 93.386 

-0.011 

19.070 

1.461 


Table 6: Values of the estimated MoE parameters for the original Tone perception data 
set. 


performs poorly for all the models. BIC provides the correct number of components for 
the three proposed models. ICL too estimated the correct number of components for 
both the SNMoE and STMoE models, but hesitates between 2 (the correct number) and 
3 components for the TMoE model. One can conclude that the BIC is the criterion to be 
suggested for the analysis. 



NMoE 

SNMoE 

TMoE 

STMoE 

K 

BIC AIC ICL 

BIC AIC ICL 

BIC AIC ICL 

BIC AIC ICI 

1 

1.8662 6.3821 1.8662 

-0.6391 5.3821 -0.6391 

71.3931 77.4143 71.3931 

69.5326 77.0592 69.53 

2 

122.8050 134.8476 107.3840 

117.7939 132.8471 102.4049 

204.8241 219.8773 186.8415 

92.4352 110.4990 82.45 

3 

118.1939 137.7630 76.5249 

122.8725 146.9576 98.0442 

199.4030 223.4880 183.0389 

77.9753 106.5764 52.56 

4 

121.7031 148.7989 94.4606 

109.5917 142.7087 97.6108 

201.8046 234.9216 187.7673 

77.7092 116.8474 56.36 

5 

141.6961 176.3184 123.6550 

107.2795 149.4284 96.6832 

187.8652 230.0141 164.9629 

79.0439 128.7194 67.74 


Table 7: Choosing the number of expert components K for the original tone perception 
data by using the information criteria BIC, AIC, and ICL. Underlined numbers indicate 
the highest value for each criterion. 


Now we examine the sensitivity of the MoE models to outliers based on this real data 
set. For this, we adopt the same scenario used in Bai et ah (2012) and Song et ah (2014) 
(the last and more difficult scenario) by adding 10 identical pairs (0,4) to the original 
data set as outliers in the ^/-direction, considered as high leverage outliers. We apply the 
MoE models in the same way as before. 

The upper plots in Figure 15 show that the normal and the skew-normal mixture of 
experts provide almost identical hts and are sensitives to outliers. However, in both cases, 
compared to the normal regression mixture result in Bai et ah (2012), and the Laplace 
regression mixture and the t regression mixture results in Song et ah (2014), the htted 
NMoE and SNMoE model are affected less severely by the outliers This may be attributed 
to the fact that the mixing proportions here are depending on the predictors, which is 
not the case in these regression mixture models, namely the ones of Bai et ah (2012), 
and Song et ah (2014). One can also see that, even the regression mean functions are 
affected severely by the outliers, the provided partitions are still reasonable and similar 
to those provided in the previous non-noisy case. Then, the bottom plots in Figure 15 
clearly show that the TMoE and the STMoE provide a robust good ht. For the TMoE, 
the obtained ht is quasi-identical to the hrst one on the original data without outliers. 
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shown in the bottom-left plot of Figure 13. For the STMoE, even if the results differ very 
slightly compared to the case with outliers, the obtained hts for both situations (with 
and without outliers) are very reasonable. Moreover, we notice that, as showed in Song 
et ah (2014), for this situation with outliers, the t mixture of regressions fails; The £t is 
affected severely by the outliers. However, for the proposed TMoE and STMoE, the ten 
high leverage outliers have no signihcant impact on the htted experts. This is because 
here the mixing proportions depend on the inputs, which is not the case for the regression 
mixture model described in Song et al. (2014). Figure 16 shows the log-likelihood prohles 


NMoE 



TMoE 



SNMoE 



STMoE 



Figure 15: Fitting MoLE to the tone data set with ten added outliers (0,4). Upper-left: 
NMoE model. Upper-right: SNMoE model. Bottom-left: TMoE model. Bottom-right: 
STMoE model. The predictor x is the actual tone ratio and the response y is the perceived 
tone ratio. 

for each of the four models, which show a similar behavior than the one in the case without 
outliers. 

The values of estimated MoE parameters in this case with outliers are given in Table 8. 
One can see that the SNMoE model parameters are identical to those of the NMoE, with 
a skewness close to zero. The regression coefficients for the second expert component 
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NMoE 


SNMoE 




TMoE STMoE 



Figure 16: The log-likelihood during the EM iterations when htting the MoLE models to 
the tone data set with ten added outliers (0,4). Upper-left: NMoE model, Upper-right: 
SNMoE model. Bottom-left: TMoE model. Bottom-right: STMoE model. 
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are very similar for all the models. For the hrst component, the TMoE model is still 
retrieving a more heavy tailed component. For the STMoE model, it retrieves a skewed 
normal component while the second component in approaching a t distribution with a 
small degrees of freedom. 


param. oio an 

model 

/3io /3ii 

ho 

hi 


<72 

Ai 

h 


V2 

NMoE 0.811 0.150 

3.117 -0.285 

1.907 

0.046 

0.700 

0.050 

- 

- 

- 

- 

SNMoE -0.810 -0.150 

3.117 -0.285 

1.907 

0.046 

0.701 

0.050 

5.5e-08 

le-08 

- 

- 

TMoE 0.888 -0.236 

0.002 0.999 

1.971 

0.020 

0.002 

0.024 

- 

- 

0.682 

0.812 

STMoE -3.004 0.732 

-0.246 1.016 

1.808 

0.060 

0.212 

0.088 

156.240 

1.757 

81.355 

1.630 


Table 8: Values of the estimated MoE parameters for the tone perception data set with 
added outliers. 


12.4.2 Temperature anomalies data set 

In this experiment, we examine another real-world data set related to climate change 
analysis. The NASA GISS Surface Temperature (GISTEMP) analysis provides a measure 
of the changing global surface temperature with monthly resolution for the period since 
1880, when a reasonably global distribution of meteorological stations was established. 
The GISS analysis is updated monthly, however the data presented here^ are updated 
annually as issued from the Carbon Dioxide Information Analysis Center (CDIAC), which 
has served as the primary climate-change data and information analysis center of the U.S. 
Department of Energy since 1982. The data consist of u = 135 yearly measurements of 
the global annual temperature anomalies (in degrees C) computed using data from land 
meteorological stations for the period of 1882 — 2012. These data have been analyzed 
earlier by Hansen et ah (1999, 2001) and recently by Nguyen and McLachlan (2014) by 
using the Laplace mixture of linear experts (LMoLE). 

To apply the proposed non-normal mixture of expert models, we consider mixtures 
of two experts as in Nguyen and McLachlan (2014). This number of components is also 
the one provided by the model selection criteria as shown later in Table 10. Indeed, as 
mentioned by Nguyen and McLachlan (2014), Hansen et ah (2001) found that the data 
could be segmented into two periods of global warming (before 1940 and after 1965), 
separated by a transition period where there was a slight global cooling (i.e. 1940 to 
1965). Documentation of the basic analysis method is provided by Hansen et ah (1999, 
2001). We set the response yi{i = 1,..., 135) as the temperature anomalies and the 
covariates Xi = Vi = (l,Xj)^ where Xi is the year of the ith observation. 

Figures 17, 18, and 19 respectively show, for each of the MoE models, the two htted 
linear expert components, the corresponding means and conhdence regions computed as 
plus and minus twice the estimated (pointwise) standard deviation as presented in Section 
9, and the log-likelihood prohles. One can observe that the four models are successfully 

^source: from Ruedy et at, http://cdiac.ornl.gov/ftp/trends/temp/hansen/gl_land.txt 
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applied on the data set and provide very similar results. These results are also similar 
to those found by Nguyen and McLachlan (2014) who used a Laplace mixture of linear 
experts. The values of estimated MoE parameters for the temperature anomalies data 

NMoE SNMoE 




TMoE STMoE 




Figure 17: Fitting the MoLE models to the temperature anomalies data set. Upper-left: 
NMoE model, Upper-right: SNMoE model, Bottom-left: TMoE model. Bottom-right: 
STMoE model. The predictor x is the year and the response y is the temperature anomaly. 

set are given in Table 9. One can see that the parameters common for the all models are 
quasi-identical. It can also be seen that the SNMoE model provides s a £t with a skewness 
very close to zero. Similarly, the STMoE model provide a solution with a skewness close 
to zero. This may support the hypothesis of non-asymmetry for this data set. Then, both 
the TMoE and STMoE hts provide a degrees of freedom more than 17, which tends to 
approach a normal distribution. On the other hand, the regression coefficients are also 
similar to those found by Nguyen and McLachlan (2014) who used a Laplace mixture of 
linear experts. 

We performed a model selection procedure on the temperature anomalies data set to 
choose the best number of MoE components from values between 1 and 5. Table 10 gives 
the obtained values of the used model selection criteria, that is BIG, AIC, and ICL. One 
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NMoE SNMoE 




Figure 18: The fitted MoLE models to the temperature anomalies data set. Upper-left: 
NMoE model, Upper-right: SNMoE model, Bottom-left: TMoE model. Bottom-right: 
STMoE model. The predictor x is the year and the response y is the temperature anomaly. 
The shaded region represents plus and minus twice the estimated (pointwise) standard 
deviation as presented in Section 9. 


param. 

model 

oio ail dio 

/3ii /32 o 

/321 

(71 

(72 

Ai 

A 2 


T72 

NMoE 

946.483 -0.481 -12.805 

0.006 -41.073 

0.020 

0.115 

0.110 

- 

- 

- 

- 

SNMoE 

950.950 -0.484 -12.805 

0.006 -41.074 

0.020 

0.115 

0.110 

-8.7e-13 

-9.2e-13 

- 

- 

TMoE 

947.225 -0.482 -12.825 

0.006 -41.008 

0.020 

0.114 

0.108 

- 

- 

70.828 

54.383 

STMoE 

931.966 -0.474 -12.848 

0.006 -40.876 

0.020 

0.113 

0.105 

0.024 

-0.015 

41.048 

17.589 


Table 9: Values of the estimated MoE parameters for the temperature anomalies data 
set. 
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Figure 19: The log-likelihood during the EM iterations when fitting the MoLE models 
to the temperature anomalies data set. Upper-left: NMoE model, Upper-right: SNMoE 
model, Bottom-left: TMoE model. Bottom-right: STMoE model. 
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can see that, except the result provided by AIC for the NMoE model which provide a 
high number of components, all the others results provide evidence for two components 
in the data. 



NMoE 

SNMoE 

TMoE 

STMoE 

K 

BIC AIC ICL 

BIC AIC ICL 

BIC AIC ICL 

BIC AIC ICL 

1 

46.0623 50.4202 46.0623 

43.6096 49.4202 43.6096 

43.5521 49.3627 43.5521 

40.9715 48.2347 40.9715 

2 

79.9163 91.5374 79.6241 

75.0116 89.5380 74.7395 

74.7960 89.3224 74.5279 

69.6382 87.0698 69.3416 

3 

71.3963 90.2806 58.4874 

63.9254 87.1676 50.8704 

63.9709 87.2131 47.3643 

54.1267 81.7268 30.6556 

4 

66.7276 92.8751 54.7524 

55.4731 87.4312 41.1699 

56.8410 88.7990 45.1251 

42.3087 80.0773 20.4948 

5 

59.5100 92.9206 51.2429 

45.3469 86.0207 41.0906 

43.7767 84.4505 29.3881 

28.0371 75.9742 -8.8817 


Table 10: Choosing the number of expert components K for the temperature anomalies 
data by using the information criteria BIC, AIC, and ICL. Underlined numbers indicate 
the highest value for each criterion. 


13 Conclusion and future work 

In this paper, we proposed new non-normal MoE models, which generalize the normal 
MoE. They are based on the skew-normal, t and skew t distribution and are respectively 
the SNMoE, TMoE, and STMoE. The SNMoE model is suggested for non-symmetric data, 
the TMoE for data with possibly outliers and heavy tail, and the STMoE is suggested 
for both possibly non-symmetric, heavy tailed and noisy data. We developed EM-type 
algorithms to infer each of the proposed models and described the use of the models in 
non-linear regression and prediction as well as in model-based clustering. The developed 
models are successfully applied on simulated and real data sets. The results obtained on 
simulated data conhrm the good performance of the models in terms of density estimation, 
non-linear regression function approximation and clustering. In addition, the simulation 
results provide evidence of the robustness of the TMoE and STMoE models to outliers, 
compared to the normal alternative models. The proposed models were also successfully 
applied to two different real data sets, including a situation with outliers. The model 
selection using information criteria tends to promote using BIC against in particular 
AIC which may perform poorly in the analyzed data. The obtained results support the 
potential beneht of the proposed approaches for practical applications. 

In this paper, we only considered the MoE in their standard (non-hierarchical) version. 
One interesting future direction is therefore to extend the proposed models to the hierar¬ 
chical mixture of experts framework (Jordan and Jacobs, 1994). Furthermore, a natural 
future extension of this work is to consider the case of MoE for multiple regression on 
multivariate data rather than simple regression on univariate data. 
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